2022-11-15

The curse of dimensionality

  • We described how methods such as LDA and QDA are not meant to be used with many predictors \(p\) because the number of parameters that we need to estimate becomes too large.

  • For example, with the digits example \(p=784\), we would have over 600,000 parameters with LDA, and we would multiply that by the number of classes for QDA.

  • Kernel methods such as kNN or local regression do not have model parameters to estimate.

  • However, they also face a challenge when multiple predictors are used due to what is referred to as the curse of dimensionality.

The curse of dimensionality

  • The dimension here refers to the fact that when we have \(p\) predictors, the distance between two observations is computed in \(p\)-dimensional space.

  • A useful way of understanding the curse of dimensionality is by considering how large we have to make a span/neighborhood/window to include a given percentage of the data.

  • Remember that with larger neighborhoods, our methods lose flexibility.

  • For example, suppose we have one continuous predictor with equally spaced points in the [0,1] interval and we want to create windows that include 1/10th of data.

The curse of dimensionality

  • Then it’s easy to see that our windows have to be of size 0.1:

The curse of dimensionality

  • Now, for two predictors, if we decide to keep the neighborhood just as small, 10% for each dimension, we include only 1 point.

  • If we want to include 10% of the data, then we need to increase the size of each side of the square to \(\sqrt{.10} \approx .316\):

The curse of dimensionality

  • Using the same logic, if we want to include 10% of the data in a three-dimensional space, then the side of each cube is \(\sqrt[3]{.10} \approx 0.464\).

  • In general, to include 10% of the data in a case with \(p\) dimensions, we need an interval with each side of size \(\sqrt[p]{.10}\) of the total.

The curse of dimensionality

  • This proportion gets close to 1 quickly, and if the proportion is 1 it means we include all the data and are no longer smoothing.

The curse of dimensionality

  • By the time we reach 100 predictors, the neighborhood is no longer very local, as each side covers almost the entire dataset.

  • Here we look at a set of elegant and versatile methods that adapt to higher dimensions and also allow these regions to take more complex shapes while still producing models that are interpretable.

  • These are very popular, well-known and studied methods.

  • We will concentrate on regression and decision trees and their extension to random forests.

CART motivation

  • To motivate this section, we will use a new dataset.

  • that includes the breakdown of the composition of olive oil into 8 fatty acids:

library(tidyverse) 
library(dslabs) 
data("olive") 
names(olive) 
#>  [1] "region"      "area"        "palmitic"    "palmitoleic" "stearic"    
#>  [6] "oleic"       "linoleic"    "linolenic"   "arachidic"   "eicosenoic"

CART motivation

  • For illustrative purposes, we will try to predict the region using the fatty acid composition values as predictors.
table(olive$region) 
#> 
#> Northern Italy       Sardinia Southern Italy 
#>            151             98            323

CART motivation

  • We remove the area column because we won’t use it as a predictor.
olive <- select(olive, -area) 
  • Let’s very quickly try to predict the region using kNN:
library(caret) 
fit <- train(region ~ .,  method = "knn",  
             tuneGrid = data.frame(k = seq(1, 15, 2)),  
             data = olive) 
ggplot(fit) 

CART motivation

CART motivation

  • We see that using just one neighbor, we can predict relatively well.

  • However, a bit of data exploration reveals that we should be able to do even better.

  • For example, if we look at the distribution of each predictor stratified by region we see that eicosenoic is only present in Southern Italy and that linoleic separates Northern Italy from Sardinia.

CART motivation

CART motivation

  • This implies that we should be able to build an algorithm that predicts perfectly! We can see this clearly by plotting the values for eicosenoic and linoleic.

CART motivation

CART motivation

  • The predictor space here consists of eight-dimensional points with values between 0 and 100.

  • In the previoui plot, we showed the space defined by the two predictors eicosenoic and linoleic, and, by eye, we can construct a prediction rule that partitions the predictor space so that each partition contains only outcomes of a one category.

  • This in turn can be used to define an algorithm with perfect accuracy.

CART motivation

  • Specifically, we define the following decision rule.

  • If eicosenoic is larger than 0.065, predict Southern Italy.

  • If not, then if linoleic is larger than \(10.535\), predict Sardinia, and if lower, predict Northern Italy.

CART motivation

  • We can draw this decision tree like this:

CART motivation

  • Decision trees like this are often used in practice.

  • For example, to decide on a person’s risk of poor outcome after having a heart attack, doctors use the following:

CART motivation

  • A tree is basically a flow chart of yes or no questions.

  • The general idea of the methods we are describing is to define an algorithm that uses data to create these trees with predictions at the ends, referred to as nodes.

  • Regression and decision trees operate by predicting an outcome variable \(Y\) by partitioning the predictors.

Regression trees

  • When the outcome is continuous, we call the method a regression tree.

  • To introduce regression trees, we will use the 2008 poll data used in previous sections to describe the basic idea of how we build these algorithms.

  • As with other machine learning algorithms, we will try to estimate the conditional expectation \(f(x) = \mbox{E}(Y | X = x)\) with \(Y\) the poll margin and \(x\) the day.

data("polls_2008") 
qplot(day, margin, data = polls_2008) 

Regression trees

Regression trees

  • The general idea here is to build a decision tree and, at the end of each node, obtain a predictor \(\hat{y}\).

  • A mathematical way to describe this is to say that we are partitioning the predictor space into \(J\) non-overlapping regions, \(R_1, R_2, \ldots, R_J\), and then for any predictor \(x\) that falls within region \(R_j\), estimate \(f(x)\) with the average of the training observations \(y_i\) for which the associated predictor \(x_i\) is also in \(R_j\).

Regression trees

  • But how do we decide on the partition \(R_1, R_2, \ldots, R_J\) and how do we choose \(J\)? Here is where the algorithm gets a bit complicated.

  • Regression trees create partitions recursively.

  • We start the algorithm with one partition, the entire predictor space.

  • In our simple first example, this space is the interval [-155, 1].

  • But after the first step we will have two partitions.

  • After the second step we will split one of these partitions into two and will have three partitions, then four, then five, and so on.

Regression trees

  • We describe how we pick the partition to further partition, and when to stop, later.

  • Once we select a partition \(\mathbf{x}\) to split in order to create the new partitions, we find a predictor \(j\) and value \(s\) that define two new partitions, which we will call \(R_1(j,s)\) and \(R_2(j,s)\), that split our observations in the current partition by asking if \(x_j\) is bigger than \(s\):

\[ R_1(j,s) = \{\mathbf{x} \mid x_j < s\} \mbox{ and } R_2(j,s) = \{\mathbf{x} \mid x_j \geq s\} \]

Regression trees

  • In our current example we only have one predictor, so we will always choose \(j=1\), but in general this will not be the case.

  • Now, after we define the new partitions \(R_1\) and \(R_2\), and we decide to stop the partitioning, we compute predictors by taking the average of all the observations \(y\) for which the associated \(\mathbf{x}\) is in \(R_1\) and \(R_2\).

  • We refer to these two as \(\hat{y}_{R_1}\) and \(\hat{y}_{R_2}\) respectively.

Regression trees

  • But how do we pick \(j\) and \(s\)? Basically we find the pair that minimizes the residual sum of squares (RSS):

\[ \sum_{i:\, x_i \in R_1(j,s)} (y_i - \hat{y}_{R_1})^2 + \sum_{i:\, x_i \in R_2(j,s)} (y_i - \hat{y}_{R_2})^2 \]

  • This is then applied recursively to the new regions \(R_1\) and \(R_2\).

  • We describe how we stop later, but once we are done partitioning the predictor space into regions, in each region a prediction is made using the observations in that region.

Regression trees

  • Let’s take a look at what this algorithm does on the 2008 presidential election poll data.

  • We will use the rpart function in the rpart package.

library(rpart) 
fit <- rpart(margin ~ ., data = polls_2008) 
  • Here, there is only one predictor.

  • Thus we do not have to decide which predictor \(j\) to split by, we simply have to decide what value \(s\) we use to split.

Regression trees

  • We can visually see where the splits were made:
plot(fit, margin = 0.1) 
text(fit, cex = 0.75) 

Regression trees

  • The first split is made on day 39.5.

  • One of those regions is then split at day 86.5.

  • The two resulting new partitions are split on days 49.5 and 117.5, respectively, and so on.

  • We end up with 8 partitions.

Regression trees

  • The final estimate \(\hat{f}(x)\) looks like this:

Regression trees

  • Note that the algorithm stopped partitioning at 8.

  • Now we explain how this decision is made.

  • First we need to define the term complexity parameter (cp).

  • Every time we split and define two new partitions, our training set RSS decreases.

  • This is because with more partitions, our model has more flexibility to adapt to the training data.

Regression trees

  • In fact, if you split until every point is its own partition, then RSS goes all the way down to 0 since the average of one value is that same value.

  • To avoid this, the algorithm sets a minimum for how much the RSS must improve for another partition to be added.

  • This parameter is referred to as the complexity parameter (cp).

  • The RSS must improve by a factor of cp for the new partition to be added.

Regression trees

  • Large values of cp will therefore force the algorithm to stop earlier which results in fewer nodes.

  • However, cp is not the only parameter used to decide if we should partition a current partition or not.

  • Another common parameter is the minimum number of observations required in a partition before partitioning it further.

  • The argument used in the rpart function is minsplit and the default is 20.

Regression trees

  • The rpart implementation of regression trees also permits users to determine a minimum number of observations in each node.

  • The argument is minbucket and defaults to round(minsplit/3).

Regression trees

  • As expected, if we set cp = 0 and minsplit = 2, then our prediction is as flexible as possible and our predictor is our original data:

Regression trees

  • Intuitively we know that this is not a good approach as it will generally result in over-training.

  • These cp, minsplit, and minbucket, three parameters can be used to control the variability of the final predictors.

  • The larger these values are the more data is averaged to compute a predictor and thus reduce variability.

  • The drawback is that it restricts flexibility.

  • So how do we pick these parameters? We can use cross validation, just like with any tuning parameter.

Regression trees

  • Here is an example of using cross validation to choose cp.
library(caret) 
train_rpart <- train(margin ~ .,  
                     method = "rpart", 
                     tuneGrid = data.frame(cp = seq(0, 0.1, len = 25)), 
                     data = polls_2008) 
ggplot(train_rpart) 

Regression trees

Regression trees

  • To see the resulting tree, we access the finalModel and plot it:
plot(train_rpart$finalModel, margin = 0.1) 
text(train_rpart$finalModel, cex = 0.75) 

Regression trees

  • And because we only have one predictor, we can actually plot \(\hat{f}(x)\):

Regression trees

  • Note that if we already have a tree and want to apply a higher cp value, we can use the prune function.

  • We call this pruning a tree because we are snipping off partitions that do not meet a cp criterion.

  • We previously created a tree that used a cp = 0 and saved it to fit.

  • We can prune it like this:

pruned_fit <- prune(fit, cp = 0.01) 

Classification (decision) trees

  • Classification trees, or decision trees, are used in prediction problems where the outcome is categorical.

  • We use the same partitioning principle with some differences to account for the fact that we are now working with a categorical outcome.

  • The first difference is that we form predictions by calculating which class is the most common among the training set observations within the partition, rather than taking the average in each partition (as we can’t take the average of categories).

Classification (decision) trees

  • The second is that we can no longer use RSS to choose the partition.

  • While we could use the naive approach of looking for partitions that minimize training error, better performing approaches use more sophisticated metrics.

  • Two of the more popular ones are the Gini Index and Entropy.

  • In a perfect scenario, the outcomes in each of our partitions are all of the same category since this will permit perfect accuracy.

Classification (decision) trees

  • The Gini Index is going to be 0 in this scenario, and become larger the more we deviate from this scenario.

  • To define the Gini Index, we define \(\hat{p}_{j,k}\) as the proportion of observations in partition \(j\) that are of class \(k\).

  • The Gini Index is defined as.

\[ \mbox{Gini}(j) = \sum_{k=1}^K \hat{p}_{j,k}(1-\hat{p}_{j,k}) \]

Classification (decision) trees

  • If you study the formula carefully you will see that it is in fact 0 in the perfect scenario described above.

  • Entropy is a very similar quantity, defined as.

\[ \mbox{entropy}(j) = -\sum_{k=1}^K \hat{p}_{j,k}\log(\hat{p}_{j,k}), \mbox{ with } 0 \times \log(0) \mbox{ defined as }0 \]

Classification (decision) trees

  • Let us look at how a classification tree performs on the digits example we examined before by using this code to run the algorithm and plot the resulting accuracy.

Classification (decision) trees

train_rpart <- train(y ~ ., 
                     method = "rpart", 
                     tuneGrid = data.frame(cp = seq(0.0, 0.1, len = 25)), 
                     data = mnist_27$train) 
plot(train_rpart) 

Classification (decision) trees

Classification (decision) trees

  • The accuracy achieved by this approach is better than what we got with regression, but is not as good as what we achieved with kernel methods:
y_hat <- predict(train_rpart, mnist_27$test) 
confusionMatrix(y_hat, mnist_27$test$y)$overall["Accuracy"] 
#> Accuracy 
#>     0.82
  • The plot of the estimated conditional probability shows us the limitations of classification trees:

Classification (decision) trees

Classification (decision) trees

  • Note that with decision trees, it is difficult to make the boundaries smooth since each partition creates a discontinuity.

  • Classification trees have certain advantages that make them very useful.

  • They are highly interpretable, even more so than linear models.

  • They are easy to visualize (if small enough).

  • Finally, they can model human decision processes and don’t require use of dummy predictors for categorical variables.

Classification (decision) trees

  • On the other hand, the approach via recursive partitioning can easily over-train and is therefore a bit harder to train than, for example, linear regression or kNN.

  • Furthermore, in terms of accuracy, it is rarely the best performing method since it is not very flexible and is highly unstable to changes in training data.

  • Random forests, explained next, improve on several of these shortcomings.

Random forests

  • Random forests are a very popular machine learning approach that addresses the shortcomings of decision trees using a clever idea.

  • The goal is to improve prediction performance and reduce instability by averaging multiple decision trees (a forest of trees constructed with randomness).

  • It has two features that help accomplish this.

  • The first step is bootstrap aggregation or bagging.

Random forests

  • The general idea is to generate many predictors, each using regression or classification trees, and then forming a final prediction based on the average prediction of all these trees.

  • To assure that the individual trees are not the same, we use the bootstrap to induce randomness.

  • These two features combined explain the name: the bootstrap makes the individual trees randomly different, and the combination of trees is the forest.

Random forests

  • The specific steps are as follows.

1. Build \(B\) decision trees using the training set.

  • We refer to the fitted models as \(T_1, T_2, \dots, T_B\).

  • We later explain how we ensure they are different.

2. For every observation in the test set, form a prediction \(\hat{y}_j\) using tree \(T_j\).

Random forests

3. For continuous outcomes, form a final prediction with the average \(\hat{y} = \frac{1}{B} \sum_{j=1}^B \hat{y}_j\).

  • For categorical data classification, predict \(\hat{y}\) with majority vote (most frequent class among \(\hat{y}_1, \dots, \hat{y}_T\)).

  • So how do we get different decision trees from a single training set? For this, we use randomness in two ways which we explain in the steps below.

  • Let \(N\) be the number of observations in the training set.

Random forests

  • To create \(T_j, \, j=1,\ldots,B\) from the training set we do the following:

1. Create a bootstrap training set by sampling \(N\) observations from the training set with replacement.

  • This is the first way to induce randomness.

2. A large number of features is typical in machine learning challenges.

Random forests

  • Often, many features can be informative but including them all in the model may result in overfitting.

  • The second way random forests induce randomness is by randomly selecting features to be included in the building of each tree.

  • A different random subset is selected for each tree.

  • This reduces correlation between trees in the forest, thereby improving prediction accuracy.

Random forests

  • To illustrate how the first steps can result in smoother estimates we will demonstrate by fitting a random forest to the 2008 polls data.

  • We will use the randomForest function in the randomForest package:

library(randomForest) 
fit <- randomForest(margin~., data = polls_2008)  
  • Note that if we apply the function plot to the resulting object, stored in fit, we see how the error rate of our algorithm changes as we add trees.

Random forests

plot(fit) 

Random forests

  • We can see that in this case, the accuracy improves as we add more trees until about 30 trees where accuracy stabilizes.

  • The resulting estimate for this random forest can be seen like this:

polls_2008 |> 
  mutate(y_hat = predict(fit, newdata = polls_2008)) |>  
  ggplot() + 
  geom_point(aes(day, margin)) + 
  geom_line(aes(day, y_hat), col="red") 

Random forests

Random forests

  • Notice that the random forest estimate is much smoother than what we achieved with the regression tree in the previous section.

  • This is possible because the average of many step functions can be smooth.

  • We can see this by visually examining how the estimate changes as we add more trees.

Random forests

  • In the following figure you see each of the bootstrap samples for several values of \(b\) and for each one we see the tree that is fitted in grey, the previous trees that were fitted in lighter grey, and the result of averaging all the trees estimated up to that point.

Random forests

Random forests

  • Here is the random forest fit for our digits example based on two predictors:
library(randomForest) 
train_rf <- randomForest(y ~ ., data=mnist_27$train) 
confusionMatrix(predict(train_rf, mnist_27$test), 
                mnist_27$test$y)$overall["Accuracy"] 
#> Accuracy 
#>     0.79

Random forests

  • Here is what the conditional probabilities look like:

Random forests

  • Visualizing the estimate shows that, although we obtain high accuracy, it appears that there is room for improvement by making the estimate smoother.

  • This could be achieved by changing the parameter that controls the minimum number of data points in the nodes of the tree.

  • The larger this minimum, the smoother the final estimate will be.

  • We can train the parameters of the random forest.

Random forests

  • Below, we use the caret package to optimize over the minimum node size.

  • Because, this is not one of the parameters that the caret package optimizes by default we will write our own code:

nodesize <- seq(1, 51, 10) 
acc <- sapply(nodesize, function(ns){ 
  train(y ~ ., method = "rf", data = mnist_27$train, 
               tuneGrid = data.frame(mtry = 2), 
               nodesize = ns)$results$Accuracy 
}) 
qplot(nodesize, acc) 

Random forests

Random forests

  • We can now fit the random forest with the optimized minimun node size to the entire training data and evaluate performance on the test data.
train_rf_2 <- randomForest(y ~ ., data=mnist_27$train, 
                           nodesize = nodesize[which.max(acc)]) 
confusionMatrix(predict(train_rf_2, mnist_27$test), 
                mnist_27$test$y)$overall["Accuracy"] 
#> Accuracy 
#>    0.815
  • The selected model improves accuracy and provides a smoother estimate.

Random forests

Random forests

  • Note that we can avoid writing our own code by using other random forest implementations as described in the caret manual.

  • Random forest performs better in all the examples we have considered.

  • However, a disadvantage of random forests is that we lose interpretability.

  • An approach that helps with interpretability is to examine variable importance.

Random forests

  • To define variable importance we count how often a predictor is used in the individual trees.

  • You can learn more about variable importance in an advanced machine learning book.

  • The caret package includes the function varImp that extracts variable importance from any model in which the calculation is implemented.

  • We give an example on how we use variable importance in a future lecture.